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Big Data world: current state 


Application 
- Problem А 
t 
More data ата sources 
More storage systems | | | | 
More processing engines/tools Gai Ей ЕЗ ЕСЕ 
More application components 
More dependencies between applications Cal A 
КЕ ЕСЕ 
Application Appacaton 
Application Application Application 
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Orchestration т Big Data 


- Orchestration 
Way of connecting components to 
provide appropriate scheduling and 
interaction 
Example: sequence of Spark jobs, 
connected by inputs/outputs 


Application 


Data sources 
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Application 


Application 


Application Application Application 
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Agenda 


AI йы 


Our project 

Orchestration in Big Data world 
Apache Airflow overview 
Production use cases 

Issues, caveats and tips 
Extensions 

Resources 
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Our project 


Digital advertising in the cloud: analytics, 
reporting, suspicious activity detection, 
ML models 

30 TBs of raw data daily 

15 billions of input events daily 

Various SLAs from 10 minutes to few hours 

Batch processing 

Machine Learning, ETL applications 

MapReduce, Hive, Spark 


Orchestration in such constraints is a real challenge! 
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Orchestration 
in Big Data world 
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Orchestration in Big Data world: requirements 


- Support of various sources 
» Web 
Media 
Transactions 


- Reliable 

- Distributed 

- Fault-tolerant 

* Scalable 

- Reproducible 

* Flexible, customizable 

- Monitoring, restatement 
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Orchestration т Big Data world: major players 


OER 
Crontab 
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AWS DataPipeline 


ОР Æ 
Jenkins Azkaban 
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Orchestration т Big Data world: our case 


Our case Existing solution 


60 applications in 5 
projects 

Codebase age: 1 month to 5 
years 


Mix of various orchestration tools (cron, 
luigi, oozie, in-house scheduler), spread 
across few clusters 


Solution 


Issues 


Maintainability Apache 


Fault-tolerance Н 
Monitoring, restatement Airflow? 


Desired goal Let's try! 


Unification 

Reliable tooling 

Ease of support for duty engineers 
Transparent deployments and versioning 
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Introduction 
to Apache Airflow 
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° Я ° “Platform to programmatically author, 
Apache Airflow: Overview schedule, and monitor workflows” 
Vocabulary 
*  DAGS (Directed acyclic graphs) = pipelines СІ example_bash_operator 
Tasks, Task instances 
" Я . % . 
Operators, Sensors Ф Tree View It Task Duration ШЕ Task Tries 
DAG run 
master, wo rker nodes КЕ Base date: 2018-09-06 00:00:01 Number of runs: 25 М 
Features 


| BashOperator || DummyOperator | 


Batch-oriented 

Define pipelines (DAGs) as Python code 
Dynamic DAGs support 

Extensible 

Parameterizing with Jinja templates 
Scalable 

Rich UI 


= run_this_last 
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Apache Airflow: Overview 
DAGs 


Search: 


DAG Schedule Owner Recent Tasks Ө Last Run Ө DAG Runs Ө 


Q 


example_bash_operator airflow 


2018-09-06 00:00 @ С) 


cma © 
example branch dop operator vo MKE airflow — (C) ІС 2018-09-05 00:56 Ө (G) 
C ао» — (5) 2018-09-06 00:00 Ө (C) 
a O 
O 


Q 


example_branch_operator 


2018-09-05 00:00 @ С) 


Q 


example_xcom airflow 


& 
AREA: 


2018-09-07 16:00 @ Ө, 


Q 


latest_only Airflow 


ET) DAG: example branch dop operator v3 
run. after loop [r] оп 2018-09-08T00:00:00+00:00 


Task Instance Details Task Instances 


| Fun | Ignore А! Deps | Ignore Task State | Ignore Task Deps | 


Past | Future | Upstream | Downstream | Recursive 
sÉ 4” Mark Failed | Past Future Upstream Downstream | 


Ж Graph View ali Task Duration W Task Tries 4 Landing Times Gantt = Details $ Code 


Base date: 2018-09-05 01:04:00 Number of runs: 25 M Go 


© BranchPythonOperator (`) DummyOperator 


Qc} 0000000000000000000000000 
Е [m m m m m m m m m m m mE | Mark success | Past | Future | upstream | Downstream | 

O condition ИЕ 

Оорт 2 m m m ma m m m = m BEN 

O condition ТД... 


Links 
OF * MAE 4 ETO 
Of * MAE 4 E00 
O93. 4z 4 S00 
O9 * MAS 4 S00 


O $#u += 4 ETO 


ЕЗ! Apache Airflow Mme 
Close 


Apache Airflow: Concepts 


* к and Sensors 
BashOperator 
PythonOperator 
PostgresOperator 
SparkSubmitOperator 
BaseBranchOperator, TriggerDagRunOperator, SubDagOperator 
S3PrefixSensor 


- Pools, Queues 
- Hooks (HDFSHook, SlackHook), Connections, Variables, XComs, etc 
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Simple DAG 


dag = DAG( 
'simple dag. example', 
default args-default args, 


description-'A simple tutorial DAG', 
schedule interval-z'*/5 * * х x', 


) 


t1 = BashOperator( 
task id-'print date', 
bash. command-'date', 
dag=dag 

) 


def print_hello(): 
print('Hello! ') 


t2 = PythonOperator ( 
task_id='print_hello', 
python_callable=print_hello, 
dag=dag 


w Se ЧЕ 


"6? DAG: simple dag example 
Ф Tree View ala Task Duratic 


22 Base date: 2019-12-01 00:05:01 Миті 


BashOperator || PythonOperator 


print_date print_hello 
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Simple DAG 


16 


3: simple_dag_example A simple tutorial DAG 


dule: *** 


Ж Graph View alt Task Duration Wi Task Tries А Landing Times = Gantt iE Details $ Code © Trigger DAG © Refresh ® Delete 


Base date: 2019-12-01 00:05:00 Number of runs: 


O BashOperator О PythonOperator 


Орле] 
© print_hello 


© print date 


Log by attempts 


25 у 


Go 


H success [E] running H tailed [F] skipped |Д up for reschedule Г ир for retry Ð queued [ ] no status 


Toggle wrap Jump to end 


жжж Reading local file: /usr/local/airflow/logs/simple, dag, example/print hello/2019-12-01T00:00:00-00:00/1. log 


2019-12-01 00:05:10,281] (ibase task runner.py: 
2019-12-01 00:05:11,840] ((base task runner.py: 
2019-12-01 00:05:11,911] {{base_task_runner. py: 
2019-12-01 00:05:11,911] {{base_task_runner. py: 
2019-12-01 00:05:12,570] {{Ыаѕе task runner.py: 
2019-12-01 00:05:14,202] (íbase task runner.py: 
2019-12-01 00:05:14,276] [ibase task runner.py: 


AIRFLOW CTX DAG ID-simple dag example 
AIRFLOW CTX ТА5Қ ID-print hello 


2019-12-01 00:05:10,229] {{taskinstance.py:620}} 
2019-12-01 00:05:10,259] {{taskinstance.py:620}} INFO 
2019-12-01 00:05:10,260] {{taskinstance.py:838}} INFO 
2019-12-01 00:05:10,260] {{taskinstance.py:839}} INFO 
2019-12-01 00:05:10,260] {{taskinstance.py:840}} INFO 


2019-12-01 00:05:10,281] {{taskinstance.py:859}} INFO 


133) 
115) 
115) 
115) 
115) 
115) 
1159 


AIRFLOW CTX EXECUTION DATE-2019-12-01T00:00:00-«00:00 
AIRFLOW CTX DAG RUN ID-scheduled 2019-12-01T00:00:00-00:00 
2019-12-01 00:05:14,313] {{logging_mixin.py:95}} INFO – Hello! 
2019-12-01 00:05:14,313] {{python_operator.py:114}} INFO - Done. Returned value was: None 

2019-12-01 00:05:15,197] {{logging_mixin.py:95}} INFO - [ [34m2019-12-01 00:05:15,196 [Øm] {{ [34mlocal task job.py: [0m105)) INFO [Øm — Task exited with rett Grid Dynamics / Apache Airflow m 


INFO - Dependencies all met for «TaskInstance: simple dag example.print hello 2019-12-01T00:00:00400:00 [qi 


- Dependencies all met for «TaskInstance: simple dag example.print hello 2019-12-01T00:00:00400:00 [qi 


— Starting attempt 1 of 2 


- Executing «Task(PythonOperator): 
- Running: 


INFO 
INFO 
INFO 
INFO 
INFO 
INFO 
INFO 


- Job 
- Job 
- Job 
- Job 
- Job 
- Job 


26: 
26: 
26: 
26: 
26: 
26: 


['airflow', 'run', 


Subtask 
Subtask 
Subtask 
Subtask 
Subtask 
Subtask 


print hello 
print hello 
print hello 
print hello 
print hello 
print hello 


print hello» on 2019-12-01T00:00:00«00:00 

'simple dag example', 'print hello', '2019-12-01T00:00:00+00:00', '- 
[2019-12-01 00:05:11,839] {{settings.py:213}} INFO - settings. confi( 
/usr/local/lib/python3.7/site-packages/psycopg2/__init__.py:144: Usi 

ини 

[2019-12-01 00:05:12,569] {{__init__.py:51}} INFO - Using executor | 
[2019-12-01 00:05:14,202] {{dagbag.py:90}} INFO - Filling up the Da 
[2019-12-01 00:05:14,275] {{cli.py:516}} INFO - Running <TaskInstani 


2019-12-01 00:05:14,312] {{python_operator.py:105}} INFO - Exporting the following env vars: 


Apache Airflow in 
Production System 
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Airflow т production: checklist 


1. Business logic 
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Business logic 
Example of production DAGs 


ML models training workflow 
Wait for new raw training data 
Process training data, add to training dataset 
Launch ML model training 
Release new model 


ML scoring workflow (batch predictions) 
Wait for input batches to score (53) 
Trigger scoring dag for each input 


-- Spark scoring with appropriate models 
— Metrics export 
Mark batches as done 


Labeled 
data 


Data 
preparation 


Models 
training 


Models 
deployment 


Input data 
for scoring 


Data 
preparation 


Models 
applying 


Scored data 
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Airflow т production: checklist 


2. Infrastructure 
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Infrastructure 


Example of Airflow setup 
- 3 Airflow masters 


(dev/staging/prod environments) 
- Each master have: 


Master Node 


Airflow Airflow 
Airflow scheduler, webserver Scheduler Webserver HE 
20 Airflow workers (CeleryExecutors) 

Y 
180 DAGs Celery 


Executor 


5 projects 
- Python virtualenvs po e- 
: itMQ) 
Airflow runs scheduler and executes с 
tasks under virtualenvs 
Celery Celery 
— scheduler env Worker Worker 
— workers env 
PythonVirtualenvOperator 


Worker Node 1 Worker Node N 
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Infrastructure 
Challenges of Airflow migrations 


Migration from Airflow 1.8 to 1.10.10 Migration from Python 2.7 to 3.6 


Future: Airflow 2.0 migration 


lose backward compatibility with 1.x versions 
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Infrastructure 
Airflow Integration 
» Executors 
SequentialExecutor 
LocalExecutor 
CeleryExecutor 
=  Kubernetes Executor 
- Cloud integrations (Operators, Hooks) 
Azure 
AWS 
GCP 
* Command Line Interface 
- REST API (experimental) 


= Managed Airflow as a service 
Astronomer 
GCP, AWS 
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Airflow т production: checklist 


3. Flexibility 
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Flexibility 
Configuration: code vs variables 


Variables 


List (9) Create Add Filter» With selected» 


Key Val 
^Ш secret password О я 
° ° , e 
- Pipeline's configuration: ЯҒ оны o 
redeployment required "EM чт 
1 ° £ Ш password жж 
* Airflow variables B 
9 U passwd w 
^Ш apiky я 
ЖШ арҝеу я 
^Ш authorization 
4 access_token 
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Flexibility 
Dynamic tasks 


t_@ = PythonOperator(task_id='some_task', python_callable=foo, dag=dag) 


for idx in range(1, 10): 
t.i = PythonOperator(task_id=f'task_{idx}', python_callable=foo, dag=dag) 
t_i.set_upstream(t_0) 


(есе), 
о) e) (<=) EI FI EEA 
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Flexibility 
Dynamic DAGS 


27 


def create_dag(dag_id): 


for idx in range(1, 10): 


BRABRARAAA-= 


Q 


dag = DAG(dag_id, default_args=default_args, schedule_interval=None) 


task = PythonOperator(task_id='some_task', python_callable=foo, dag=dag) 


return dag 


dag_id = f'dynamic_dag_{idx}' 


globals()[dag_id] 


DAG 
dynamic_dag_1 
dynamic_dag_2 
dynamic_dag_3 
dynamic_dag_4 
dynamic_dag_5 
dynamic_dag_6 
dynamic_dag_7 
dynamic_dag_8 


dynamic_dag_9 


Schedule 


Owner 


airflow 


airflow 


airflow 


airflow 


airflow 


airflow 


airflow 


airflow 


airflow 


create_dag(dag_id) 


Recent Tasks @ 


O OO OOO 


Last Run @ 

2019-11-30 23:44 @ 
2019-11-30 23:44 @ 
2019-11-30 23:44 @ 
2019-11-30 23:44 @ 
2019-11-30 23:44 @ 
2019-11-30 23:44 @ 
2019-11-30 23:44 @ 
2019-11-30 23:45 @ 


(©) 2019-11-30 23:45 @ 


DAG Runs Ө 


JOD 


O 
O 
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Airflow т production: checklist 


4. Testing 
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Testing 


29 


Local development 
- Airflow Docker image 
*  Airflow setup 
Unit testing 
* Test that DAG “compiles” 
* Test structure of DAG's 
tasks 
* Test Operators and tasks 


End-to-end testing with 
REST API 

Separate DAGs for 
integration testing 


def test dag(dag id, task id 1, task id 2): 
dag_bag = DagBag(dag_folder=dag_folder, executor='LocalExecutor' ) 
assert len(dag_bag.import_errors) == 
assert dag_id in dag_bag.dags 
dag = dag_bag.get_dag(dag_id) 
task_1 = dag.get_task(task_id_1) 
assert task_1 
downstream_tasks = [t.task_id for t in task_1.downstream_list] 
assert downstream_tasks == [task_id_2] 


task = pipeline.dag.task_dict['move_file'] 

task.templates_dict = dict(src_path=src_path, dst_path=dst_path) 
ti = TaskInstance(task-task, execution_date=datetime.now() ) 
task.execute(ti.get_template_context()) 

assert os.path.exists(dst_path) 
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Airflow т production: checklist 


5. Deployment 


30 Grid Dynamics / Apache Airflow E 


Deployment 
DAG versioning 
- New DAG after update 
create DAG with new name 
manage actual version "OM simple dag example v2.0 airflow (2) 


| Modifying existing DAG ET simple dag example. v2.1 airflow (2) 


new tasks 
updates of running DAGS 


99 p a 
es 
Орле] 0000000 
О print_another_hello LIL IL IL I IIS 
O print. hello ОООО 
O print. date ETAn 
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Deployment 
DAG management 


Deployment 


update artifact (pipeline 
and dependencies) on 
master and workers hosts 


Issues 


pause DAG scheduling 


wait for completion of 
dag_runs 


deploy new version 


resume DAG scheduling 
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Airflow т production: checklist 


6. Optimisation 
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Optimisation 
Execution tuning by airflow.cfg 


parallelism: number of task instances per worker 
concurrency: number of scheduled task instances 
for DAG 

max_active_runs: number of running DagRuns for 
DAG 

celery configurations 


scheduler configurations 
max_threads 
scheduler_heartbeat_sec 


.airflowignore 
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Airflow т production: checklist 


7. Monitoring 


35 Grid Dynamics / Apache Airflow E 


Monitoring 


= Airflow sends emails оп DAGs failures or retries 


= Tasks monitoring 
sla - time by which the job is expected to finish 
execution_timeout - max time allowed for the execution of this task instance 
retries, retry delay - multiple attempts for task execution 
- DAGs monitoring: 
dagrun timeout - max time allowed for the execution of dag run 
sla. miss callback - a function to call when reporting SLA timeouts 


= SLAs view 


36 Grid Dynamics / Apache Airflow Mme 


But what if we need 
even (;\:);--? 
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Apache Airflow 
Extensions 


38 Grid Dynamics / Apache Airflow E 


Extensions mechanism 


Custom operators, sensors, hooks 
Custom views, Ul elements 
Plugins for integration and auto-importing 
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VAML Config 


Issues 


Hardcoded values 


Code duplication 


Overcomplicated pipeline's code 


40 
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VAML Config 


Issues 


Hardcoded values 


Code duplication 


Overcomplicated pipeline's code 


# base_config.yaml 

dag: 
start_date: !datetime 2019-01-01 00:00 
email: !airflow_variable mailing_list 
execution date: !execution. date 
output: !from xcom 

key: output 

env: loverride required 


# dev config.yaml 
include: !include. parsed base. config.yaml 
env: loverride 


key: include.dag.env 
value: dev 
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Autodocs 


Issues 
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Autodocs 


Input/Output Storage 
Issues апыр М 


Environment Storage Type Storage URL Data TTL, Days Data Volume, GB 


Outdated documentation for 
prod output /prod/out 1 10 


duty engineers 


Pipeline's Summary 


Lack of stan d a rd i zati on This pipeline moves data from the local FS to the database. 


Airflow DAG 


airflow-dag-example-pipeline 


Pipeline autodocs 
Main pipeline code 


class pipeline.CustomOperator(*args, **kwargs) [source] 
This is first dummy operator's custom operator class 


class pipeline. SomeOtherOperator(*args, **kwargs) [source] 
Custom operator for some other operator 
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Monitoring 


Issue 
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Monitoring 


Issue Solution 

State of DAGS in case of Airflow outage 

Dump logs from scheduler, tasks to ElasticSearch 
Airflow database 

Grafana dashboards 

zabbix hooks to handle tasks completion 


Goal 


Provide reliable external monitoring 
system 


[DAGs] success [DAGs] failed 


60 
40 
LE 
| | | 


12/1 12:00 12/2 00:00 12/2 12:00 12/3 00:00 


[DAGs] [Cumulative] running © [DAGs] running 


6 


À 


12/3 00:00 12/1 12:00 12/2 00:00 12/2 12:00 12/3 00:00 


12/1 12:00 12/2 00:00 12/2 12:00 12/3 00:00 12/1 12:00 12/2 00:00 12/2 12:00 


— success_dags — running. dags 


[DAGs] Total running 


[DAGs] Long-running 


[DAGs] Failed 
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— running. dags 


count* last start date 


16.00 2019-08-12714:14:18.0542542 


7.00 
3.00 
2.00 
2.00 
2.00 
2.00 
2.00 
2.00 
1.00 


1.00 


2019-09-26714:36:41.3701392 
2019-11-20718:58:05.6355552 
2019-11-06703:05:48.5555462 
2019-11-26704:30:35.6687632 
2019-11-26704:30:36.5417572 
2019-07-01717:09:44.3267762. 
2019-11-26704:30:36.0964432 
2019-11-26704:30:36.9878392 
2019-11-27700:00:07.4614912 


2019-11-27710:03:14.7453352 


dag. id 


— failed dags 


[DAGs] Failed 


start date v 

2019-12-03T06:01:52.947547Z 
2019-12-03705:31:01.1384432 
2019-12-03705:01:08.1990162 
2019-12-03704:30:28.0498852 
2019-12-03T04:30:18.649981Z 
2019-12-03704:01:22.7375372 
2019-12-03703:30:57.0850632 
2019-12-03703:01:07.9823042 
2019-12-03702:40:15.0163772 


2019-12-03702:30:08.7094742 


Conclusion 
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Conclusion 


- Apache Airflow successfully works as ап orchestration tool and 


initiated process of tools unification in our project 

- scalable, fault-tolerant 
flexible and customizable 
mature enough to handle production workloads 
have a good community 
In our project it significantly reduced number of production incidents 
Grid Dynamics actively contributes to Airflow 

" Next steps 

- Migration from on-prem to AWS 
Airflow on Kubernetes 
Common library for AWS pipelines 
Self-service for infrastructure provisioning and scaling 
Migration to Airflow 2.0 
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Resources 


https://airflow.apache.or 

https://github.com/apache/airflow 
https://stackoverflow.com/questions/tagged /airflow 

Common Pitfalls: 
https://cwiki.apache.org/confluence/pages/viewpage.action?pageld=62694614 


Official Slack workspace: https://apache-airflow.slack.com/ 


Russian Telegram community (1200 members): https://t.me/ruairflow 


S 
F 
E 
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Thank you! 


(iio 


